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Abstract 

Background: Evidence suggests that common complex diseases may be partially due to SNP-SNP interactions, but 
such detection is yet to be fully established in a high-dimensional small-sample (small-n-large-p) study. A number 
of penalized regression techniques are gaining popularity within the statistical community, and are now being 
applied to detect interactions. These techniques tend to be over-fitting, and are prone to false positives. The recently 
developed stability least absolute shrinkage and selection operator ( s LASSO) has been used to control family-wise error 
rate, but often at the expense of power (and thus false negative results). 

Results: Here, we propose an alternative stability selection procedure known as stability smoothly clipped absolute 
deviation ( S SCAD). Briefly, this method applies a smoothly clipped absolute deviation (SCAD) algorithm to multiple 
sub-samples, and then identifies cluster ensemble of interactions across the sub-samples. The proposed method was 
compared with s LASSO and two kinds of traditional penalized methods by intensive simulation. The simulation 
revealed higher power and lower false discovery rate (FDR) with S SCAD. An analysis using the new method on the 
previously published GWAS of lung cancer confirmed all significant interactions identified with s LASSO, and identified 
two additional interactions not reported with s LASSO analysis. 

Conclusions: Based on the results obtained in this study, S SCAD presents to be a powerful procedure for the detection 
of SNP-SNP interactions in large-scale genomic data. 

Keywords: Genome-wide association study (GWAS), Interaction, Least absolute shrinkage and selection operator 
(LASSO), Penalized logistic regression, Smoothly clipped absolute deviation (SCAD), Stability selection 



Background 

High-dimensional genomic data are becoming increas- 
ingly available to assist in the identification of genetic 
factors for complex diseases such as lung cancer. In par- 
ticular, genome-wide association studies (GWAS) have 
implicated a variety of genetic variants in many diseases, 
while only a small fraction of phenotypic variation was ex- 
plained by those. This suggests that multi-locus gene-gene 
or gene-environment interactions must be considered [1]. 

Gene-gene interactions could be detected using a variety 
of methods [2] . For example, multifactor dimensionality re- 
duction (MDR, [3]) is a non-parametric and model-free 
method that does not require any assumption of genetic 



* Correspondence: fengchen@njmu.edu.cn 

'Department of Epidemiology and Biostatistics and Ministry of Education 
(MOE) Key Lab for Modern Toxicology, School of Public Health, Nanjing 
Medical University, Nanjing, China 

Full list of author information is available at the end of the article 



mode of inheritance. However, MDR is inefficient in hand- 
ling large scale genetic datasets [4]. Penalized regression 
methods such as least absolute shrinkage and selection op- 
erator (LASSO) [5] and smoothly clipped absolute devi- 
ation (SCAD) [6] are also widely used for high-dimensional 
data. LASSO is a useful tool for detecting gene-gene inter- 
actions with a broad range of simulations [7]. SCAD 
penalty has an oracle property, and thus it is more consist- 
ent with the actual effects than LASSO [6]. The cross- 
validation is usually used for the choice of the amount 
of regularization in penalized regression methods (e.g., 
LASSO and SCAD), but it often includes too many noise 
variables. In an attempt to minimize such a problem, a 
modified LASSO penalized method, stability LASSO 
( s LASSO), has been proposed to unify optimal shrink- 
age and variable selection in GWAS ([8]). Stability 
selection controls false discovery rate and renders cross- 
validation practically unnecessary. Alexander and Lange 
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(2011) claimed that sLASSO could accurately identify the 
most important regions of GWAS, but in a simulation 
study s LASSO offers less power than the simpler and less 
computationally intensive methods of marginal association 
testing [8]. 

It has been shown that the LASSO penalty could pro- 
duce a bias even in the simple regression setting due to 
its linear increase of penalty on regression coefficients. 
To remedy this bias issue, a non-concave penalty such 
as SCAD penalty was proposed. SCAD has the so-called 
oracle property, meaning that, in the asymptotic sense, it 
performs as effectively as if an analyst had known in ad- 
vance which coefficients were zero and which ones were 
nonzero [6]. SCAD is capable of achieving the sparse 
estimator in combination with smaller biases in linear re- 
gression than LASSO. Here, we propose a new stability se- 
lection procedure in combination with SCAD penalization 
( S SCAD). The new method was compared to s LASSO 
using systematic simulations and a published GWAS study. 

Methods 

Ethics statement 

This collaborative study was approved by the institutional 
review boards of China Medical University, Tongji Medical 
College, Fudan University, Nanjing Medical University, and 
Guangzhou Medical College with written informed consent 
from all participants. 

Penalized logistic regression for case-control GWAS 

Let ji denote the disease status of the individual i ( i =1, ...,«): 
1 for case and 0 for control. The SNP of individual i, Xy, 
is formatted as the count of a particular allele (0, 1, or 2) 
where j=l,...,m. The logistic model below includes SNP- 
SNP interaction terms: 

j ; ~Binominal(l, 7T;), 

71- m 
log ~~ = /S 0 + Pff) + ZjkXijXik, i = 1, 
1 Ui ;=1 j<k 

(1) 

where Xy and Xipc ik are main effect and interaction fea- 
tures, respectively. 

Penalized likelihood method makes the fitting of a lo- 
gistic model with small-n-large-p computationally feas- 
ible. It also provides a mechanism for feature selection. 
L(G) denotes the likelihood function of the above logistic 
model (1), where 6 consists of those fi and £. The penal- 
ized log-likelihood function takes the form 

l p (9) = -2logL(6) + J2Px(0j), (2) 

where p\{') is the penalty function characterized by a 
tuning parameter \. The following penalty functions are 
used in LASSO and SCAD, respectively: 



LASSO penalty : p x (Of) = X\dj\, 

SCAD penalty: p\{\9\) = \{1{\0\<X) + ( ^ l(|fl| > A)}, 

where a is a fixed constant larger than 2, the notation (•)+ 
stands for the positive part, and l(-) denotes the indicator 
function. 

When the penalized logistic regression model is fitted, a 
predetermined number of the components of 0 can be 
forced to zero by tuning A to a certain value. For a specific 
variable, estimation of the coefficient is non-zero if the co- 
efficient exceeds the threshold or equals to zero. Thus the 
selection of tuning parameter is a crucial step at the appli- 
cation of penalized likelihood. This is usually accomplished 
with cross validation. We used cross validation predictive 
area under the ROC curve to choose the appropriate tun- 
ing parameter. 

LASSO and SCAD with cross -validated tuning param- 
eter selection often lead to the inclusion of too many noise 
variables for high-dimensional data [9]. For variable selec- 
tion in small-n-large-p genomic data, choosing the amount 
of regularization is more challenging than predicting where 
a cross-validation scheme can be used. A false variable in 
variable selection may lead to apparent association with a 
disease phenotype either through chance or correlation 
with the true variables. Studies using high-dimensional 
data often need to be validated due to false variables. An- 
other practical issue here is reducing false variables while 
maintaining the power to detect relevant variables. To 
address this problem, Meinshausen and Biilmann [9] pro- 
posed a stability selection procedure that is relatively 
insensitive to the choice of tuning parameter [9] . 

In the current study, SCAD was used in variable selec- 
tion in each sub-sample, and then stability selection was 
used to identify consensus ensemble of solutions. 

Stability selection procedure 

a) Meinshausen and Biilmann (MB) stability selection 
methodology 

Stability selection seeks to identify the non-zero 
entries S = {k:9k * 0} of a sparse parameter vector in 
above penalized logistic regression model (2). 
Assuming that the set / is a uniform random 
sub-sample of the index set {1,. ..,«}, the index set was 
used to subsample from the data to yield a 
subset Z(I). For the subset and a given regularization 
parameter X e A, penalized regression procedure was 
used to yield an estimate of 9 k , i.e., 9^(1). Selection vari- 
able set was denoted as 

S x {I) = {k:9\*0). 
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The conditional selection probability of the k-th 
covariate was defined as 



>{/ce5 A (/)|X,j}. 



(3) 



The selection probabilities were naturally estimated 
by Monte Carlo method averaging over B times 
independent sub-sampling. Variables with high 
selection probabilities were retained, while those 
with low selection probabilities were discarded. 
For a cut-off n thr with 0 < n t h r < 1 and a set of 
regularization parameters A, the stable selection 
variables set was defined as 



^stable 



k : maxn, >77>i,,. 



(4) 



The basic idea of the stability selection is to repeat 
the feature selection process in many randomly 
perturbed subsamples (e.g., by bootstrapping the 
samples in the original data set), and to include 
features that are relevant to majority of the 
subsamples. The baseline of the stability selection 
procedure is explained below: 



Given a cut-off, compute the stable selection 
variables set S staMe = h : max fl^TT^ } . 
The threshold value Ji thr is a tuning parameter 
whose influence is very small. In principle, the 
tuning parameter of MB is based on the following 
theorem 1 of Meinshausen and Biilmann. 



Theorem 1 (error control). Assuming that the 
distribution of {Ir^ykeN} is exchangeable fori 
e A, and the original procedure is not worse than 
random guessing. Let q A be the average number of 
selected variables, q A = E\ U S the expected 

AeA 

number V of falsely selected variables is then 
bounded for n thr & (1/2, 1] by 



E(V)< 



1 



27Tffe--l P 



(5) 



b) Improvements of the MB stability selection 
In the current study (where p S> «), the primary 
goal was controlling the false discovery rate (FDR): 



Input: Data set (X,y) , B times independent sub-samples, a set of regularization parameters A 

Output: Selection probabilities (n^) i=1 p /!(EA and stable selection variables set S s,ahle 
for ie {l,...,B} do 

Draw a subsample of {1, ■ ■ ■ n} with a size [n/2] without replacement, denoted it by /; 
for Ag A do 

Run a variable selection algorithm on / with the regularization parameter A ; 

Save the active set 0 k (I) ; 

end 
ehd 

for k e {\,...,p} do 
for A.E A do 

Compute the selection probability flf = P{k 6 S X (I) | X, y} ; 

end 
end 
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FDR = E(V/\ S stable \)«E(V) /q A 
1 ?A 

2n t hr-^ m + m[m-X) jl 



1 



Ik 



2n thr -l p 



(6) 



An advantage of the stability selection is that the 
choice of the regularization parameters A does not 
have strong influence on the results, as long as X is 
varied within a reasonable range [9] . To control 
FDR, we first chose a fixed regularization region A, 
and then chose the selective probability threshold 
n t f, r according to the above inequality (6). 
We set a fixed regularization region as A = [X^, X max ], 
which was decided by the number of selected variables 
q as follow: X max corresponded to the variable that first 
entered the regularization path and X min was chosen 
such that the first q variables that appeared in the 
regularization path, mathematically, X min was chosen 
such that IUa^jca^a^o | < q. The value of q was 
chosen a priori to yield a non-trivial bound (see 
discussion on the paper by Meinshausen and 
Buhlmann [9]), i.e. q= 0(^/{2n thr -l)p)). The 
choice of q in stability selection does not have a strong 
impact on the FDR [9]. We used a conservative 
estimate of q (the square root of the number of 
predictors) in the discovery stage. 
For the fixed regularization region, we applied 
the SCAD procedure to every subsample. 
q A = E\U S X (I)\ was estimated via the Monte 
Carlo simulation averaging over B times 
independent sub-sampling. The threshold value 
Jithr was solved while maintaining FDR < a 
according to the expression (6) as 



-M)/2, if q A <pa. 
pa J 1 



(7) 



Unfortunately, given the nature of genetic data, the 
exact hypotheses required by the theorem of 
Meinshausen and Biilmann are unlikely to hold [9] . 
In particular, the exchangeability assumption of 
Theorem 1 about the indicator random variables 
{lr^ggn , £eAfj is questionable due to the 
correlations among the markers induced by linkage 
disequilibrium. We worried that the false positives of 
stability selection might be grossly wrong in our 
genetic data. So we adopted the method described in 
Alexander and Lange [8] to make a rough check on 
the false discovery rate of stability selection. We 
randomly permuted the phenotype vector y for all 
participants, firstly. We then performed the stability 
selection procedure on the permutation sample and 
obtained the selection probability of the variable 
corresponding to the maximum test statistic in the 
association analysis, and finally compared the 



selection probability with the cut-off calculated from 
the theorem of Meinshausen and Biilmann. 

Data simulation 

S SCAD selection procedure was compared with LASSO, 
SCAD, and s LASSO under a variety of interaction models. 

Genotype simulation 

HAPGEN (v2.2.0) program [10] was used to simulate 
genotype information. The simulation parameters for 
SNP frequencies and variance structure were extracted 
from HapMap3 JPT + CHB that includes SNPs located 
within ±20-kb of ABCC4 (ATP-binding cassette sub- 
family C member 4) at 13q31. The legend file for the 
SNP markers, and the fine scale recombination rate were 
downloaded from the HapMap website (http://hapmap. 
ncbi.nlm.nih.gov/downloads/index.html.en). After quality 
control, 327 common SNPs remained (with the exclu- 
sion criteria: missing data of SNPs > 5%, minor allele 
frequence < 5%, Hardy- Weinberg p-value < 10' ). The link- 
age disequilibrium (LD) pattern is shown in Figure 1. 

Phenotype simulations 

Genetic interaction model was applied for case/control 
phenotypes simulations. During the phenotype simulation, 
we took m =327 (C 2 m = 53301 two-way interactions) and 
randomly selected causal SNPs from different haplotype 
blocks. The blocks were computed through Haploview 
v4.2 by standard expectation-maximization algorithm [11], 
which partitioned the region into segments of strong link- 
age disequilibrium (LD). A total of 26 blocks were gener- 
ated with a minimum size of 2 markers and a maximum 
size of 64 markers. All the causal SNPs and SNP-SNP in- 
teractions were assumed to improve the risk (OR = 1.3, 1.4, 
and 1.5, respectively); wherein we let y, denote the pheno- 
type value of subject i, which is obtained according to the 
logistic regression model (1). 

We conducted simulations to evaluate selection per- 
formance of the LASSO, SCAD, s LASSO and S SCAD 
procedures under the following scenarios: 

(A) . The interactive SNPs have no main effects: 

Bj = 0 for all /, and * 0 for some randomly 
chosen /, k. 

(B) . Only one SNP in the SNP-SNP interaction pair has 

a main effect: 

£jk * 0 and 8j * 0 for some randomly chosen j, k. 

(C) . Both interactive SNPs have main effects: 

fyk *■ 0, Bj * 0 and fik * 0 for some randomly 
chosen /, k. 

The odds ratio parameters are shown in Table 1. Since 
there are only a few etiological loci - only a few of the 
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Figure 1 Pairwise r among the 327 SNPs across the gene ABCC4. The color of each box signifies the value of r between SNPs alleles, with 
the black indicating strongest relationship between a pair of marks (1 = black, 0 = white). 



coefficients in the model are nonzero - the phenomenon 
is referred to as being sparse. 

For every simulation scenario of phenotype in Table 1, 
the phenotype y t was generated based on the simu- 
lated SNPs by HAPGEN 2 using the above-mentioned 
logistic regression model (1). We simulated the popu- 
lation with an equal number of cases and controls 
(n/2 = 10,000) with 200 replicate data sets, and then 
1,000 cases and 1,000 controls were randomly sampled 
from the population to form one sample set. Next, we 
performed different variable selection methods for each 
sample set. 

Simulated data analysis 

R software (version 2.14.0, The R Foundation for 
Statistical Computing) was used to perform the simu- 
lation. The package "glmnet" and modified package 
"ncvreg" were used for LASSO and SCAD analysis, 
respectively. For stability selection, we chose B = 500 
times independent sub-samples with a size of 500 
cases and 500 controls from each 1000-1000 cases- 
controls sample set. 



Application 

The study subjects were from an ongoing two-center 
(Nanjing and Beijing, China) GWAS of lung cancer in 
China. At recruitment, written informed consent was ob- 
tained from each subject. The study was approved by the 
institutional review boards of each participating institution. 
The details of population and other related information 
were described previously [12]. A systematic quality con- 
trol procedure was applied for both SNPs and individuals. 
SNPs were excluded if they did not map on autosomal 
chromosomes, with minor allele frequency < 0.05, with call 
rate < 95%, with p < 1 x 10' 5 for Hardy-Weinberg equilib- 
rium in combined samples of two studies or p < 1 x 10" 4 in 
either the Nanjing or Beijing study samples. We removed 
samples with a call rate of < 95%, ambiguous gender, famil- 
ial relationships, extreme heterozygosity rate, and outliers. 
Briefly, there were 1,473 cases and 1,962 controls in the 
Nanjing center, 858 cases and 1,115 controls in the Beijing 
center after quality control. 

A multi-stage strategy is often used for detecting inter- 
actions on a genome-wide scale. The method pro- 
posed in the current study could not be directly 



Table 1 Parameter settings of the different kinds of scenarios 

Feature of simulated set Scenario Number of nonzero Number of nonzero Locations of causal variants Designed OR 

main effects interactions 

0 1 SNP33 x SNP197 1.5/1.4/1.3 

1 1 SNP18 + SNP18 X SNP134 1.5/1.4/1.3 

2 1 SNP33 + SNP1 34 + SNP33 X SNP1 34 1.5/1.4/1.3 



ABCC4 (327 SNPs, 53301 A1/A2/A3 
two-way interactions) B1/B2/B3 



C1/C2/C3 



Abbreviations: ABCC4 ATP-binding cassette sub-family C member 4, OR odds ratio, SNP single nucleotide polymorphisms. 
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applied to genome-wide scale SNPs data since it is too 
computationally intensive to exhaustively search for all 
SNP pairs. A filtering method could be helpful, as 
explained below using the achPathway pathway (a role of 
nicotinic acetylcholine receptors in the regulation of 
apoptosis). This pathway is one of the top pathways asso- 
ciated with lung cancer risk in the Han Chinese popula- 
tion. Several studies have shown that the nicotinic 
acetylcholine receptors can induce cell proliferation as 
well as angiogenesis [13]. The achPathway pathway in- 
cludes the genes PIK3R1, PTK2B, PTK2, AKT1, PIK3CG, 
FASLG, MUSK, CHRNG, RAPSN, BAD, FOX03, TERT, 
CHRNB1, PIK3CA, SRC and YWHAH. All SNPs are 
mapped into genes within 20 kb downstream or upstream. 
All together, there are 344 common SNPs. We conducted 
an exhaustive search ( C\^ = 58996 ) of two-way inter- 
action in the pathway. Covariates including age, gender, 
pack-year of smoking, and the first two principal compo- 
nents, which have been proposed to sufficiently adjust for 
population stratification derived from EIGENSTRAT 3.0 
[14], were adjusted in the stability selection procedure [12]. 

To increase confidence in the selection of significant in- 
teractions from tens of thousands of SNP pairs, interac- 
tions findings often need to be replicated in independent 
studies. We adopted a two-stage strategy in the current 
study. In the initial discovery stage, we used s LASSO and 
S SCAD to select significant SNP-SNP interactions using 
the data from the Nanjing center. In the replication stage, 
the findings in the initial step were validated using the data 
from the Beijing center with s LASSO and S SCAD. A slight 
variation was made to calibrate the significant threshold 
for the replication phase (i.e., we set the initial fixed num- 
ber of variables in Beijing study as the number of selected 
variables in the discovery stage). 

The SNP pairs were selected using the following criteria: 
(i) the interaction had the selection probability 77/ > n t h r i m 
the Nanjing study, while in the Beijing study the selection 
probability was 77/ > n th ^2 (nthri and 77^,2 are the significant 
thresholds of the Nanjing and Beijing studies, correspond- 
ing to the control of the FDR under 0.1); (ii) the Nanjing 
and Beijing centers both demonstrated identical direction 
of odds ratios for the two SNPs, with their interaction de- 
rived from penalized logistic regression. 

Results 

Result of simulation 

We evaluated the performance of different variable se- 
lection procedures using four established statistical in- 
dexes, including the true positive rate (TPR): 



MCC = 



TP x TN-FP x FN 



TPR 



TP 



TP + FN 



yf{TP + FP){TP + FN) ( TN + FP)(TN + FN) 
the estimated area under the ROC curves (AUC) [16]: 



1 'o P 

auc = —^—y y 



I(Pu > Pv)+l 2 I iPu= Pv) 



and the estimated false discovery rate (FDR, [17]): 
FP 

FDR 



FP+TP' 



where TP and TN stand for true positives and true nega- 
tives, FP and FN stand for false positives and false nega- 
tives, respectively. p u is the selection probability of the 
w-th predictor and the first t 0 variables are assumed to 
be true signals. The index TPR is known as sensitivity, 
whereas MCC is generally regarded as a balanced meas- 
ure for both sensitivity and specificity, AUC summarizes 
overall prediction performance, and FDR is a criterion to 
measure and control the number of false positives for 
the class-skewed high-throughput data [18]. The indexes 
TPR, MCC and AUC are used to measure the power of 
detecting interactions, while FDR is primarily used to as- 
sess false positives of detection. 

The simulation results of different procedures for the 
three kinds of scenarios are summarized in Tables 2, 3 
and 4. All indexes are presented as average and standard 

Table 2 Selection performance of different methods in 
different scenarios (OR= 1.5) 





Method 




Scenario 








Al 


B1 


C1 


TPR 


LASSO 


0.768(0.027) 


0.775(0.020) 


0.857(0.018) 




SCAD 


0.801(0.025) 


0.775(0.021) 


0.603(0.016) 




sLASSO 


0.750(0.018) 


0.780(0.009) 


0.990(0.005) 




sSCAD 


0.933(0.013) 


0.897(0.002) 


0.968(0.003) 


MCC 


LASSO 


0.292(0.014) 


0.384(0.012) 


0.356(0.021) 




SCAD 


0.350(0.013) 


0.435(0.013) 


0.373(0.016) 




sLASSO 


0.776(0.011) 


0.706(0.012) 


0.875(0.011) 




sSCAD 


0.826(0.008) 


0.843(0.016) 


0.837(0.015) 


AUC 


LASSO 


0.593(0.018) 


0.601(0.002) 


0.632(0.002) 




SCAD 


0.596(0.018) 


0.606(0.002) 


0.632(0.002) 




sLASSO 


0.612(0.001) 


0.616(0.002) 


0.638(0.005) 




sSCAD 


0.631(0.001) 


0.615(0.001) 


0.630(0.002) 


FDR 


LASSO 


0.859(0.013) 


0.827(0.007) 


0.789(0.026) 




SCAD 


0.837(0.009) 


0.813(0.008) 


0.742(0.018) 




sLASSO 


0.1 86(0.006) 


0.160(0.009) 


0.107(0.017) 




sSCAD 


0.167(0.003) 


0.147(0.007) 


0.166(0.014) 



the Mathhews correlation coefficient (MCC) [15]: 



Abbreviations: TPR true positive rate, MCC Matthews correlation coefficient, 

AUC area under the ROC curve, FDR false discovery rate. 

Numbers in each cell represent mean (standard error) by 200 times simulation. 
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Table 3 Selection performance of different methods in 
different scenarios (OR = 1.4) 





Method 




Scenario 








A2 


B2 


C2 


TPR 


LASSO 


0.726(0.028) 


0.726(0.021) 


0.797(0.020) 




SCAD 


0.750(0.026) 


0.740(0.021) 


0.734(0.022) 




sLASSO 


0.706(0.019) 


0.720(0.012) 


0.942(0.010) 




S SCAD 


0.861(0.016) 


0.890(0.004) 


0.953(0.002) 


MCC 


LASSO 


0.240(0.015) 


0.334(0.013) 


0.309(0.013) 




SCAD 


0.295(0.014) 


0.380(0.01 5) 


0.333(0.014) 




sLASSO 


0.716(0.012) 


0.636(0.015) 


0.826(0.013) 




S SCAD 


0.787(0.009) 


0.800(0.01 8) 


0.817(0.018) 


AUC 


LASSO 


0.537(0.018) 


0.542(0.002) 


0.577(0.003) 




SCAD 


0.538(0.019) 


0.529(0.003) 


0.570(0.004) 




sLASSO 


0.557(0.002) 


0.569(0.003) 


0.590(0.003) 




sSCAD 


0.557(0.003) 


0.561(0.003) 


0.560(0.001) 


FDR 


LASSO 


0.868(0.014) 


0.829(0.010) 


0.797(0.008) 




SCAD 


0.838(0.009) 


0.817(0.010) 


0.754(0.009) 




sLASSO 


0.198(0.007) 


0.146(0.009) 


0.122(0.010) 




sSCAD 


0.166(0.004) 


0.149(0.007) 


0.180(0.007) 


Abbreviations: TPR true positive rate, MCC Matthews correlation coefficient, 
AUC area under the ROC curve, FDR false discovery rate. 

Numbers in each cell represent mean (standard error) by 200 times simulation. 


Table 4 Selection performance of different methods in 
different scenarios (OR = 1 .3) 




Method 




Scenario 








A3 


B3 


C3 


TPR 


LASSO 


0.681(0.032) 


0.664(0.023) 


0.777(0.021) 




SCAD 


0.715(0.028) 


0.678(0.023) 


0.7853(0.022) 




sLASSO 


0.651(0.019) 


0.692(0.012) 


0.874(0.012) 




sSCAD 


0.835(0.014) 


0.808(0.005) 


0.878(0.005) 


MCC 


LASSO 


0.173(0.018) 


0.298(0.018) 


0.303(0.016) 




SCAD 


0.253(0.016) 


0.350(0.017) 


0.312(0.014) 




sLASSO 


0.676(0.012) 


0.600(0.015) 


0.768(0.01 3) 




S SCAD 


0.736(0.012) 


0.737(0.014) 


0.778(0.019) 


AUC 


LASSO 


0.506(0.018) 


0.524(0.003) 


0.536(0.006) 




SCAD 


0.517(0.021) 


0.514(0.006) 


0.536(0.006) 




sLASSO 


0.518(0.002) 


0.545(0.005) 


0.535(0.005) 




sSCAD 


0.537(0.005) 


0.541(0.002) 


0.551(0.002) 


FDR 


LASSO 


0.869(0.015) 


0.838(0.011) 


0.798(0.010) 




SCAD 


0.832(0.013) 


0.819(0.008) 


0.726(0.009) 




sLASSO 


0.188(0.007) 


0.164(0.006) 


0.182(0.010) 




sSCAD 


0.169(0.006) 


0.163(0.007) 


0.174(0.008) 



Abbreviations: TPR true positive rate, MCC Matthews correlation coefficient, 
AUC area under the ROC curve, FDR false discovery rate. 

Numbers in each cell represent mean (standard error) by 200 times simulation. 



error using 200 replications. The simulation results 
based on the Tables 2, 3 and 4 are described from the 
following two perspectives. 

(I) . s LASSO/ s SCAD has lower false discovery rate than 
LASSO/SCAD while possessing similar AUC 

It appears that s LASSO and S SCAD have lower FDR for 
identifying interactions in comparison to LASSO or 
SCAD. Contrary to LASSO and SCAD which generated 
unacceptably high FDR in all scenarios, both s LASSO 
and S SCAD controlled FDR at an acceptable level. In 
regards to predictive AUC, there was no difference in 
stability selection procedures that being its inclusion or 
exclusion. In other words, s LASSO or S SCAD achieved a 
higher specificity than other procedures despite the simi- 
lar diagnostic accuracy of AUC. 

(II) . S SCAD has more robust power against s LASSO among 
different interaction models 

Given an acceptable FDR level, we compared S SCAD 
with the s LASSO procedure in the detection of SNP- 
SNP interactions. sLASSO lost its ability to rapidly de- 
tect interactions as the reduction of the main effects 
from the scenarios C1/C2/C3 to scenarios B1/B2/B3 and 
A1/A2/A3. S SCAD, on the other hand, possessed robust 
detecting power under all scenarios. For the scenario 
A1/A2/A3, in which the model only included the SNP- 
SNP interaction without any main effects of SNPs, 
S SCAD was more powerful than s LASSO. An exemplifi- 
cation of this can be seen in scenario Al, in which the 
criteria of measuring the power of variable selection pro- 
cedures echoed the trend: therein the TPR of S SCAD 
was 93.3%, while the one of sLASSO was only roughly 
75.0%. Likewise, MCC and AUC were also both higher 
with S SCAD than with s LASSO. 

The underlying interactions were better detected with 
sLASSO in the scenario CI where the corresponding 
main effects were not too small (Table 2). s LASSO pos- 
sessed slightly higher TPR, MCC and AUC than S SCAD 
in the scenario CI. sSCAD was more powerful than 
s LASSO in the scenario C2/C3 where the corresponding 
main effects ranged from small to moderate (Tables 3 
and 4). 

Generally speaking, the SCAD penalty has an edge 
over LASSO in selection features, namely those where 
the selective features are more consistent with their ac- 
tual effects. The LASSO penalty may introduce more 
false interactions than the SCAD in the sparse high- 
dimensional models. Thus, sLASSO loses more true pos- 
itives than S SCAD when controlling FDR estimation of 
stability selection at the desired level. 

Overall, since the underlying interaction model is gen- 
erally unknown, and a wide range of interaction models 
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Table 5 Empirical selection probability of significant SNP pairs by s LASSO and S SCAD under subsampling 





Trait 






Nanjing Study 




Beijing Study 




Pooled Study 




SNP1(rs) 


Genel 


SNP2(rs) 


Gene2 


a n 




b p 


n 




P 


n 




P 










sLASSO 


S SCAD 


Marginal 


sLASSO 


S SCAD 


Marginal 


sLASSO 


S SCAD 


Marginal 


rs7839119 


PTK2 


rs 12544802 


PTK2 


0.724 c * 


0.702* 


1 .04 x 1 0~ 6 


0.628* 


0.940* 


6.34 X 10~ 2 


0.668* 


0.801* 


3.10 x 10"* 


rs3781626 


RAPSN 


rs60 18348 


SRC 


0.734* 


0.680* 


3.25 X 10~ 3 


0.514* 


0.654* 


2.88 X 10~ 3 


0.617* 


0.824* 


1 .40 X 1 0' A 


rs78391 19 


PTK2 


rs4524871 


MUSK 


0.746* 


0.734* 


7.1 9 X 10~ 4 


0.478 


0.518* 


4.98 X 1 0~ 2 


0.600* 


0.980* 


5.02 X 10~ 4 


rs2736100 


TERT 


rs403 1 8 


PIK3R1 


0.586 


0.830* 


9.87 X 1 0~ 6 


0.390 


0.582* 


2.22 X 10~ 2 


0.696* 


0.977* 


2.51 X 10~ 6 



°n represents the empirical selection probability of SNP pairs under subsampling. 
b p stands for the trend test p-value for simple marginal logistic regression. 

c The significant SNP pairs under stability selection are coded by (*) to indicate its selection probability being higher than the threshold value (implied 
by FDR < 0.1). 



without marginal effects do exist [19], S SCAD is a valu- 
able tool for discovering interactions without main ef- 
fects and complement s LASSO in GWAS. 

Result of application 

Two SNP-SNP interactions were significant in both the 
discovery and replication phases by s LASSO (Table 5), 
and four SNP-SNP interactions were significant in both 
phases of sSCAD selection. S SCAD contained all significant 
interactions identified by s LASSO. When using S SCAD, the 
two pairs (rs7839119-rs4524871 and rs2736100-rs40318) 
were shown to have significant interactions in both the dis- 
covery and replication populations. In contrast, neither of 
two pairs was validated as significant in the replication 
phase with s LASSO. 

For the s LASSO procedure, there were ten significant 
(jii > Tttkni FDR < 0.1) two-way SNP-SNP interactions in 
the Nanjing discovery study (Table 6). Among these ten 
SNP pairs, two were selected (jii > n t h r i, FDR < 0.1) in the 
replication phase (Table 7). Both SNP-SNP interactions 
(rs7839119-rsl2544802 and rs3781626-rs6018348) were 

Table 6 Empirical selection probability of significant SNP 



pairs in Nanjing study by the 


sLASSO 




SNPI(rs) 


Genel 


SNP2(rs) 


Gene2 


a n 

Nanjing study 5 LASSO 


rs929087 


FASLG 


rs 12544802 


PTK2 


0.964 


rs4946933 


FOX03 


rsl 1231 740 


BAD 


0.890 


rs2853462 


CHRNG 


rs7856889 


MUSK 


0.880 


rs7445640 


TERT 


rsl 0733579 


MUSK 


0.824 


rs411751 


PIK3R1 


rs939269 


PTK2B 


0.794 


rs7839119 


PTK2 


rs4524871 


MUSK 


0.746 


rs3781626 


RAPSN 


rs60 18348 


SRC 


0.734 


rs7839119 


PTK2 


rs 12544802 


PTK2 


0.724 


rs725787 


PTK2B 


rs5998196 


YWHAH 


0.688 


rs6578141 


PTK2 


rs 1940245 


MUSK 


0.636 



a H represents a predictor's empirical probability of model inclusion under 500 
times subsampling. 



verified in the replication phase. In an overall analysis 
that included discovery and replication datasets (5,408 
subjects; 2,331 cases and 3,077 controls), the empirical 
selection probabilities of rs7839119-rsl2544802 and 
rs3781626-rs6018348 interactions were 0.668 and 0.617, 
respectively; thus, indicating little statistical significance 
fa > n th „ FDR < 0.1). 

Under the S SCAD procedure, all four significant 
SNP-SNP interactions (rs7839119-rsl2544802, rs378 
1626-rs6018348, rs7839119-rs4524871 and rs2736100- 
rs40318) were successfully replicated (Tables 5, 8, 
and 9). The empirical selection probabilities of rs7839119- 
rsl2544802, rs3781626-rs6018348, rs7839119-rs4524871 
and rs2736100-rs40318 interactions in the overall analysis, 
which included all 5,408 subjects, were 0.801, 0.824, 0.980 
and 0.977, respectively. In turn, these results indicate statis- 
tical significance (jii > n t h„ FDR < 0.1). 

We also included the result of one permuted data set 
from the total 5,408 subjects combined. The selection 
probabilities of the s LASSO and S SCAD were 0.402 and 
0.306, respectively. This corresponds to the maximum 
value of the test statistic for the permutation set. The 
cutoffs obtained from above inequality (6) for s LASSO 
and S SCAD with the significance (FDR < 0.1) were 0.593 
and 0.560, respectively; thus, suggesting that the FDR calcu- 
lated from the Meinshausen and Biilmann theorem is con- 
servative. There appears to be little danger of selecting 
grossly inaccurate FDR when applying the Meinshausen 
and Biilmann theory. 

Discussion 

Identifying interactions among multiple SNPs is both statis- 
tically and computationally challenging in large-scale asso- 
ciation studies. The challenges include high-dimensional 
problems, computational capability, multiple testing prob- 
lems, and genetic heterogeneity [20]. Many stochastic and 
heuristic detecting epistasis methods [21] could be used to 
analyze GWAS dataset. Wang et al. used AntEpiSeeker, a 
two-stage ant colony optimization algorithm (ACO), to 
identify epistasis [22]. Wan et al. proposed SNPRuler [23] 
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Table 7 Empirical selection probability of significant SNP 
pairs in Beijing study by the s LASSO 



SNPI(rs) 


Genel 


SNP2(rs) 


Gene2 


a n 

Beijing study s LASSO 


rs3779632 


PTK2B 


rs9644448 


PTK2 


1.000 


rs2736100 


TERT 


rs 11 994882 


PTK2B 


0.904 


rs1 01 09684 


PTK2 


rs1 1231 735 


BAD 


0.904 


rs 11994882 


PTK2B 


rs4983387 


AKT1 


0.840 


rs6969923 


PIK3CG 


rs1 1997161 


PTK2 


0.788 


rs2677764 


PIK3CA 


rs2821 142 


MUSK 


0.784 


rs4551415 


PTK2 


rs 13597 11 


MUSK 


0.740 


rs1 550099 


CHRNG 


rs 108 17088 


MUSK 


0.706 


rs 12466358 


CHRNG 


rs2565062 


PTK2B 


0.700 


rs3791 723 


CHRNG 


rs78391 1 9 


PTK2 


0.684 


rs78391 19 


PTK2 


rs 12544802 


PTK2 


0.628 


rs9773817 


PTK2B 


rs6018088 


SRC 


0.624 


rs479744 


FOX03 


rs7952435 


BAD 


0.610 


rs2736122 


TERT 


rs1 2945577 


CHRNB1 


0.598 


rs 108 17082 


MUSK 


rs5994451 


YWHAH 


0.596 


rs2S 1 398 


PIK3R1 


rs1 0733579 


MUSK 


0.530 


rs3781626 


RAP5N 


rs60 18348 


SRC 


0.514 


rs4727666 


PIK3CG 


rs7856889 


MUSK 


0.506 



a n represents a predictor's empirical probability of model inclusion under 500 
times subsampling. 



based on both predictive rule inference, and two-stage 
design. Boolean operation-based screening and testing 
(BOOST) [24] involves only Boolean values, and allows the 
use of fast logic operations to obtain contingency tables. 
TEAM [25] exploits properties of test statistics to mitigate 



Table 8 Empirical selection probability of significant SNP 
pairs in Nanjing study by the S SCAD 



SNPI(rs) 


Genel 


SNP2(rs) 


Gene2 


a n 

Nanjing study S SCAD 


rs929087 


FASLG 


rs1 2544802 


PTK2 


0.904 


rs2853462 


CHRNG 


rs7856889 


MUSK 


0.876 


rs2736100 


TERT 


rs40318 


PIK3R1 


0.830 


rs7445640 


TERT 


rs1 0733579 


MUSK 


0.786 


rs411751 


PIK3R1 


rs939269 


PTK2B 


0.756 


rs7839119 


PTK2 


rs4524871 


MUSK 


0.734 


rs7839119 


PTK2 


rs 12544802 


PTK2 


0.702 


rs725787 


PTK2B 


rs5998196 


YWHAH 


0.692 


rs3781626 


RAPSN 


rs60 18348 


SRC 


0.680 


rs4946933 


FOX03 


rs1 1231740 


BAD 


0.640 


rs6578141 


PTK2 


rs 1940245 


MUSK 


0.604 



a TI represents a predictor's empirical probability of model inclusion under 500 
times subsampling. 



Table 9 Empirical selection probability of significant SNP 
pairs in Beijing study by the S SCAD 

SNP1(rs) Genel SNP2(rs) Gene2 a f7 



Beijing study S SCAD 


rs7839119 


PTK2 


rs 12544802 


PTK2 


0.940 


rs3779632 


PTK2B 


rs9644448 


PTK2 


0.828 


rsl 051 5077 


PIK3R1 


rsl 081 7088 


MUSK 


0.658 


rs3781626 


RAPSN 


rs60 18348 


SRC 


0.654 


rs2736122 


TERT 


rsl 2945577 


CHRNB1 


0.610 


rs3639 


PTK2 


rs3781626 


RAPSN 


0.602 


rs3791723 


CHRNG 


rs7839119 


PTK2 


0.594 


rs2736100 


TERT 


rs40318 


PIK3R1 


0.582 


rs9773817 


PTK2B 


rs60 18088 


SRC 


0.574 


rs3800230 


FOX03 


rs7856889 


MUSK 


0.568 


rs 109805 10 


MUSK 


rs3829603 


CHRNB1 


0.560 


rs2677764 


PIK3CA 


rs2853668 


TERT 


0.556 


rs41 1 751 


PIK3R1 


rs9609396 


YWHAH 


0.526 


rs9480867 


FOX03 


rsl 1231 741 


BAD 


0.522 


rs7839119 


PTK2 


rs4524871 


MUSK 


0.518 


rs4524871 


MUSK 


rs 10980564 


MUSK 


0.502 



a Tl represents a predictor's empirical probability of model inclusion under 500 
times subsampling. 



multiple testing problems. To date, there appears to be no 
one method free from model sensitivity. 

In addition to non-parametric and model-free methods, 
many LASSO-based penalized parametric methods provide 
the estimation of parameter as the dimensionality in- 
creases, even if the number of variables is greater than the 
sample size. The coefficients of those none disease- 
associated SNPs will be zero in the penalized multivariate 
regression model. Thus, detecting interactions is equivalent 
with the variable selection problem under the framework 
of regression analysis. A broad range of simulations has 
demonstrated that the penalized regression method is a 
useful tool for detecting gene-gene interactions. However, 
the regularization choice in penalized regression is usually 
made by cross-validation that maximizes predictive accur- 
acy in finite samples; although it does not necessarily in- 
duce the correct sparseness pattern for variable selection 
[26] . In our simulations, cross-validation often leads to the 
inclusion of too many noise variables, and induces instabil- 
ity of variable selection for the ordinary penalized regres- 
sion method, such as LASSO or SCAD. A major hurdle 
for studying interactions in GWAS is the lack of efficient 
algorithms that can map different forms of interactions 
while keeping FDR under control [27] . s LASSO introduces 
stability selection into traditional LASSO. The stability se- 
lection procedure combines selection algorithms for high 
dimensional problems by sub-sampling. s LASSO dramatic- 
ally reduces the number of false discovery rate, and could 
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accurately identify crucial regions of GWAS; however, it is 
overly conservative, and may miss some important regions. 

S SCAD procedure increases the power of detecting the 
interactions while controlling FDR. It attempts to pro- 
vide more true interactions, but less noise terms than 
s LASSO. The above advantage could be attributed to the 
fact that running the LASSO-penalized procedure within 
stability selection results in more false positives than 
SCAD for each random sub-sample. Thus, the interac- 
tions causing noise as well as true interactions in the 
region both satisfy the threshold condition Ji t h r for selec- 
tion. To control the number of falsely selected variables, 
the threshold must be very stringent. As a result, the 
sLASSO selection suffers a loss of power. 

We analyzed a previously reported lung cancer data- 
set in Han Chinese, and confirmed significant interac- 
tions in the achPathway pathway, which supported the 
appropriate use of the proposed method. The observa- 
tion of interactions between two closely located SNP 
pairs supports the hypothesis that some genetic vari- 
ation in complex traits may hide in interactions be- 
tween linked SNPs [28]. 

Application of the proposed procedure to GWAS data 
may ensure that the power of detection is reduced when 
over-stringent threshold Ji t hr conditions exist for the 
much increased ratio of SNPs to samples. A good alter- 
native to derive genome-wide significant threshold is 
permutation. Unfortunately, genome-wide permutation 
in real GWAS of interactions is computationally prohibi- 
tive for the sSCAD selection. Partial search strategies 
based on biological knowledge [29] or the filtering of un- 
important SNPs prior to analysis [30] could be adopted 
to reduce excessive computing burdens in early stage of 
genome-wide scale. These strategies are also necessary 
for the proposed method. 

Under our current approach, high-dimensional data 
were primarily managed with sparse models. High corre- 
lations (individual SNPs that have a variance inflation 
factor (VIF) > 2 with other markers) were excluded. The 
chip data were pruned, and then analyzed with regres- 
sion model method using a sparse constraint. Many 
common diseases may be associated with many SNPs 
with small to moderate effects. In this situation, we are 
considering group penalized methods in another paper. 

Conclusions 

We developed a variable selection procedure (referred to 
as S SCAD selection). This procedure could control the 
FDR while maintaining the power to detect SNP-SNP in- 
teractions in association studies. In the pure interaction 
model, this procedure seems to overcome the conserva- 
tiveness of sLASSO. The end result is that sSCAD, as a 
new technique in detecting interactions, can benefit the 
selection of s LASSO. 
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